Resource-Elasticity Support for Distributed Memory HPC Applications
نویسنده
چکیده
Computer simulations are alternatives to the scientific method in domains where physical experiments are unfeasible or impossible. When the amount of memory and processing speed required is large, simulations are executed in distributed memory High Performance Computing (HPC) systems. These systems are usually shared among its users. A resource manager with a batch scheduler is used to fairly and efficiently share the resources of these systems among its users. Current large HPC systems have thousands of compute nodes connected over a high-performance network. Users submit batch job descriptions where the number of resources required by their simulations are specified. Batch job descriptions are queued and scheduled based on priorities and submission times. The parallel efficiency of a simulation depends on the number of resources allocated to it. It is challenging for users to specify allocation sizes that produce adequate parallel efficiencies. A resource allocation can be too small and the parallel efficiency of the application may be adequate, but its performance may not be scaled to its maximum potential. A resource allocation can be too large and therefore the parallel efficiency of the application may be degraded due to synchronization overheads. Unfortunately, in current systems these resource allocations cannot be adapted once the applications of a job start. A resource manager and MPI library combination that adds resource-elasticity support for HPC applications is proposed in this work. The resource manager is extended with operations to adapt the resources of running applications in jobs; in addition, new scheduling techniques are added to it. The MPI library has been extended with operations that enable resource adaptations as changes in the number of processes in world communicators. The goal is to optimize system-wide efficiency metrics through adjustments to the resource allocations of running applications. Resource allocations are adjusted continuously based on performance feedback from running applications.
منابع مشابه
Distributed Software Transactional Memories : A
Distributed Transactional Memory (DTM) aims at introducing a novel programming paradigm combining the simplicity of Transactional Memory (TM)[11] with the scalability and failure resiliency achievable by exploiting the resource redundancy of distributed platforms. These features make the DTM model particularly attractive for inherently distributed application domains such as Cloud computing or ...
متن کاملDistributed Software Transactional Memories : A Summary of Research
Distributed Transactional Memory (DTM) aims at introducing a novel programming paradigm combining the simplicity of Transactional Memory (TM)[11] with the scalability and failure resiliency achievable by exploiting the resource redundancy of distributed platforms. These features make the DTM model particularly attractive for inherently distributed application domains such as Cloud computing or ...
متن کاملTowards Next Generation Resource Management at Extreme-Scales
With the exponential growth of distributed systems in both FLOPS and parallelism (number of cores/threads), scientific applications are growing more diverse with various workloads. These workloads include traditional large-scale high performance computing (HPC) MPI jobs, and HPC ensemble workloads that support the investigation of parameter sweeps using many small-scale coordinated jobs, as wel...
متن کاملInvasive Computing in HPC with X10 pdfauthor=Hans-Joachim Bungartz, Christoph Riesinger, Martin Schreiber, Andreas Zwinkau pdfkeywords=X10, Invasic
High performance computing with thousands of cores relies on distributed memory due to memory consistency reasons. The resource management on such systems usually relies on static assignment of resources at the start of each application. Such a static scheduling is incapable of starting applications with required resources being used by others since a reduction of resources assigned to applicat...
متن کاملFramework for Enhancing the Performance of Data Intensive MPI based HPC applications on Cloud
Corresponding Author: Ashwini Janagal Padmanabha Nitte Meenakshi Institute of Technology, Bangalore, Karnataka, India Email: [email protected] Abstract: Cloud computing is a new technology which is revolutionizing the current business model with pay-per-usage resource provisioning method. This model proves to be more profitable compared to traditional resource procurement and maintenanc...
متن کامل